This is a notebook explaining the mathematics of Linear and Logistic Regression. It is adapted from by blog. All of the code is available on GitHub.
$ X = ( x^{(1)}, x^{(2)}, .., x^{(n)} ) $.
For each of these (input) datapoints there is a corresponding (output) $ y^{(i)}$-value.
Here, the $ x$-datapoints are called the independent variables and $ y$ the dependent variable; the value of $ y^{(i)} $ depends on the value of $ x^{(i)} $, while the value of $ x^{(i)}$ may be freely chosen without any restriction imposed on it by any other variable.
The goal of Regression analysis is to find a function $ f(X) $ which can best describe the correlation between $ X $ and $ Y $. In the field of Machine Learning, this function is called the hypothesis function and is denoted as $ h_{\theta}(x) $. If we can find such a function, we can say we have successfully built a Regression model.
If the input-data lives in a 2D-space, this boils down to finding a curve which fits through the data points. In the 3D case we have to find a plane and in higher dimensions a hyperplane.
To give an example, let's say that we are trying to find a predictive model for the success of students in a course called Machine Learning. We have a dataset $ Y $ which contains the final grade of $ n $ students. Dataset $ X $ contains the values of the independent variables. Our initial assumption is that the final grade only depends on the studying time. The variable $ x^{(i)} $ therefore indicates how many hours student $ i $ has studied. The first thing we would do is visualize this data:
If the results looks like the figure on the left, then we are out of luck. It looks like the points are distributed randomly and there is no correlation between $ Y$ and $ X$ at all. However, if it looks like the figure on the right, there is probably a strong correlation and we can start looking for the function which describes this correlation.
This function could for example be:
$ h_{\theta}(X) = \theta_0+ \theta_1 \cdot x $
or
$ h_{\theta}(X) = \theta_0 + \theta_1 \cdot x^2 $
where $ \theta $ are the dependent parameters of our model.$ h_{\theta}(x) = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2$
or$ h_{\theta}(x) = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2^3 $.
So, what makes linear regression linear is not that $ Y$ depends in a linear way on $ X $, but that it depends in a linear way on $ \theta$. $ Y $ needs to be linear with respect to the model-parameters $ \theta $. Mathematically speaking it needs to satisfy the superposition principle.
Examples of nonlinear regression would be:
$ h_{\theta}(x) = \theta_0 + x_1^{\theta_1} $
or$ h_{\theta}(x) = \theta_0 + \theta_1 / x_1 $
$ J(x) = \frac{1}{2n} \sum_i^n ( h_{\theta}(x^{(i)}) - y^{(i)} )^2 $
$ \Delta\theta = - \alpha \frac{d}{d\theta} J(x) $.
The reason for this is that a function's value decreases fastest if we move towards the direction of the negative gradient (the directional derivative is maximal in the direction of the gradient).
Taking all this into account, this is how gradient descent works:
Just as important as the initial guess of the parameters is the value you choose for the learning rate $ \alpha $. This learning rate determines how fast you move along the slope of the gradient. If the selected value of this learning rate is too small, it will take too many iterations before you reach your convergence criteria. If this value is too large, you might overshoot and not converge.
In order to understand the hypothesis function of Logistic Regression, we must first understand the idea behind classification.
When you want to classify something, there are a limited number of classes it can belong to. And for each of these possible classes there can only be two states for $ y^{(i)} $; either $ y^{(i)} $ belongs to the specified class and $ y = 1 $, or it does not belong to the class and $ y = 0 $. Even though the output values $ Y $ are binary, the independent variables $ X $ are still continuous. So, we need a function which has as input a large set of continuous variables $ X $ and for each of these variables produces a binary output. This function, the hypothesis function, has the following form:
$ h_{\theta} = \frac{1}{1 + \exp(-z)} = \frac{1}{1 + \exp(-\theta x)} $.
This function is also known as the logistic function, which is a part of the sigmoid function family. These functions are widely used in the natural sciences because they provide the simplest model for population growth. However, the reason why the logistic function is used for classification in Machine Learning is its 'S-shape'.
As you can see this function is bounded in the y-direction by 0 and 1. If the variable $ z$ is very negative, the output function will go to zero (it does not belong to the class). If the variable $ z$ is very positive, the output will be one and it does belong to the class. (Such a function is called an indicator function.)
The question then is, what will happen to input values which are neither very positive nor very negative, but somewhere 'in the middle'. We have to define a decision boundary, which separates the positive from the negative class. Usually this decision boundary is chosen at the middle of the logistic function, namely at $ z = 0 $ where the output value $ y$ is $ 0.5$.
\begin{equation} y =\begin{cases} 1, \text{if $z \gt 0 $}.\\ 0, \text{if $z \lt 0 $}. \end{cases} \end{equation}As we can see in the formula of the logistic function, $ z = \theta \cdot x $. Meaning, the dependent parameter $ \theta$ (also known as the feature), maps the input variable $ x$ to a position on the $ z$-axis. With its $ z$-value, we can use the logistic function to calculate the $ y$ -value. If this $ y$-value $ \gt 0.5 $ we assume it does belong in this class and vice versa.
So the feature $ \theta $ should be chosen such that it predicts the class membership correctly. It is therefore essential to know which features are useful for the classification task. Once the appropriate features are selected , gradient descent can be used to find the optimal value of these features.
How can we do gradient descent with this logistic function? Except for the hypothesis function having a different form, the gradient descent method is exactly the same. We again have a cost function, of which we have to iteratively take the gradient w.r.t. the feature $ \theta $ and update the feature value at each iteration.
This cost function is given by the log-likelihood function known as binary cross-entropy:
\begin{equation} \begin{split} J(x) = -\frac{1}{2n} \sum_i^n \left( y^{(i)} log( h_{\theta}(x^{(i)})) + (1-y^{(i)})log(1-h_{\theta}(x^{(i)})) \right) \\ = -\frac{1}{2n} \sum_i^n \left( y^{(i)} log(\frac{1}{1+exp(-\theta x)}) + (1-y^{(i)})log(1-\frac{1}{1+exp(-\theta x)}) \right) \end{split} \end{equation}
We know that:
$ log(\frac{1}{1+exp(-\theta x)}) = log(1) - log(1+exp(-\theta x)) = - log(1+exp(-\theta x))$
and
$ log(1-\frac{1}{1+exp(-\theta x)}) = log( \frac{exp(-\theta x)}{1+exp(-\theta x)}) $
$ = log(exp(-\theta x)) - log(1+exp(-\theta x)) $
$ = -\theta x^{(i)} - log(1+exp(-\theta x)) $
Plugging these two equations back into the cost function gives us:
\begin{equation} \begin{split} J(x) = - \frac{1}{2n} \sum_i^n \left( - y^{(i)} log(1+exp(-\theta x)) - (1-y^{(i)})(\theta x^{(i)} + log(1+exp(-\theta x))) \right) \\ = - \frac{1}{2n} \sum_i^n \left( y^{(i)} \theta x^{(i)} -\theta x^{(i)} -log(1+exp(-\theta x)) \right) \end{split} \end{equation}
The gradient of the cost function with respect to $ \theta $ is given by
\begin{align} \frac{d}{d\theta} J(x) = - \frac{1}{2n} \sum_i^n \left( y^{(i)} x^{(i)} - x^{(i)} + x^{(i)} \frac{ exp(-\theta x)}{1+exp(-\theta x)} \right) \\ = - \frac{1}{2n} \sum_i^n \left( x^{(i)} ( y^{(i)} - 1 +\frac{exp(-\theta x)}{1+exp(-\theta x)} ) \right) \\ = - \frac{1}{2n} \sum_i^n \left( x^{(i)} ( y^{(i)} - \frac{1}{1+exp(-\theta x)} ) \right) \\ = - \frac{1}{2n} \sum_i^n \left( x^{(i)} ( y^{(i)} - h_{\theta}(x^{(i)}) )\right) \end{align}
So the gradient of the seemingly difficult cost function, turns out to be a much simpler equation. And since it is the gradient we use to update the values of $\theta$ this makes our work much easier.
Gradient descent for Logistic Regression is again performed in the same way:
Let's generate some data points. There are $n = 300$ students participating in the course Machine Learning and whether a student $ i $ passes ( $ y_i = 1 $) or not ( $ y_i = 0$ ) depends on two variables;
In our example, the results are pretty binary; everyone who has studied less than 4 hours fails the course, as well as everyone whose studying time + sleeping time is less than or equal to 13 hours ($x_i^{(1)} + x_i^{(2)} \leq 13$).
In [2]:
import random
import numpy as np
def func2(x_i):
if x_i[1]+x_i[2] >= 13:
return 0
else:
return 1
def generate_data2(no_points):
X = np.zeros(shape=(no_points, 3))
Y = np.zeros(shape=no_points)
for ii in range(no_points):
X[ii][0] = 1
X[ii][1] = random.random()*9+0.5
X[ii][2] = random.random()*9+0.5
Y[ii] = func2(X[ii])
return X, Y
X, Y = generate_data2(300)
The results looks like this (the green dots indicate a pass and the red dots a fail)
We have a LogisticRegression class, which sets the values of the learning rate $\alpha$ and the maximum number of iterations max_iter at its initialization.
The values of X, Y are set when these matrices are passed to the “train()” function, and then the values of no_examples, no_features, and theta are determined.
We also have the hypothesis, cost and gradient functions.
The gradient descent method uses these methods to update the values of $theta$.
The “train()” method, first sets the values of the matrices X, Y and theta, and then calls the gradient_descent method.
Once the $\theta$ values have been determined with the gradient descent method, the training of the classifier is complete and we can use it to classify new examples.
In [5]:
import numpy as np
class LogisticRegression():
"""
Class for performing logistic regression.
"""
def __init__(self, learning_rate = 0.7, max_iter = 1000):
self.learning_rate = learning_rate
self.max_iter = max_iter
self.theta = []
self.no_examples = 0
self.no_features = 0
self.X = None
self.Y = None
def add_bias_col(self, X):
bias_col = np.ones((X.shape[0], 1))
return np.concatenate([bias_col, X], axis=1)
def hypothesis(self, X):
return 1 / (1 + np.exp(-1.0 * np.dot(X, self.theta)))
def cost_function(self):
"""
We will use the binary cross entropy as the cost function. https://en.wikipedia.org/wiki/Cross_entropy
"""
predicted_Y_values = self.hypothesis(self.X)
cost = (-1.0/self.no_examples) * np.sum(self.Y * np.log(predicted_Y_values) + (1 - self.Y) * (np.log(1-predicted_Y_values)))
return cost
def gradient(self):
predicted_Y_values = self.hypothesis(self.X)
grad = (-1.0/self.no_examples) * np.dot((self.Y-predicted_Y_values), self.X)
return grad
def gradient_descent(self):
for iter in range(1,self.max_iter):
cost = self.cost_function()
delta = self.gradient()
self.theta = self.theta - self.learning_rate * delta
if iter % 100 == 0: print("iteration %s : cost %s " % (iter, cost))
def train(self, X, Y):
self.X = self.add_bias_col(X)
self.Y = Y
self.no_examples, self.no_features = np.shape(X)
self.theta = np.ones(self.no_features + 1)
self.gradient_descent()
def classify(self, X):
X = self.add_bias_col(X)
predicted_Y = self.hypothesis(X)
predicted_Y_binary = np.round(predicted_Y)
return predicted_Y_binary
In [7]:
import pandas as pd
from evaluators import *
to_bin_y = { 1: { 'Iris-setosa': 1, 'Iris-versicolor': 0, 'Iris-virginica': 0 },
2: { 'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 0 },
3: { 'Iris-setosa': 0, 'Iris-versicolor': 0, 'Iris-virginica': 1 }
}
#loading the dataset
datafile = '../datasets/iris/iris.data'
df = pd.read_csv(datafile, header=None)
df_train = df.sample(frac=0.7)
df_test = df.loc[~df.index.isin(df_train.index)]
X_train = df_train.values[:,0:4].astype(float)
y_train = df_train.values[:,4]
X_test = df_test.values[:,0:4].astype(float)
y_test = df_test.values[:,4]
Y_train = np.array([to_bin_y[3][x] for x in y_train])
Y_test = np.array([to_bin_y[3][x] for x in y_test])
print("training Logistic Regression Classifier")
lr = LogisticRegression()
lr.train(X_train, Y_train)
print("trained")
predicted_Y_test = lr.classify(X_test)
f1 = f1_score(predicted_Y_test, Y_test, 1)
print("F1-score on the test-set for class %s is: %s" % (1, f1))
As you can see, our simple LogisticRegression class can classify the iris dataset with quiet a high accuracy.
In [ ]: